Identifying instructions from chat messages in the Radiation Response Game.
In this notebook, we explore the performance of classification using the provenance of a data entity instead of its dependencies (as shown here and in the paper). In order to distinguish between the two, we call the former historical provenance and the latter forward provenance. Apart from using the historical provenance, all other steps are the same as the original experiments.
The RRG dataset based on historical provenance is provided in the rrg/ancestor-graphs.csv
file, which contains a table whose rows correspond to individual chat messages in RRG:
label
: the manual classification of the message (e.g., instruction, information, requests, etc.)Note that in this extra experiment, we use the full (historical) provenance of a message, not limiting how far it goes. Hence, there is no $k$ parameter in this experiment.
In [1]:
import pandas as pd
In [2]:
filepath = "rrg/ancestor-graphs.csv"
In [3]:
df = pd.read_csv(filepath, index_col=0)
df.head()
Out[3]:
In [4]:
label = lambda l: 'other' if l != 'instruction' else l
In [5]:
df.label = df.label.apply(label).astype('category')
df.head()
Out[5]:
In [6]:
# Examine the balance of the dataset
df.label.value_counts()
Out[6]:
Since both labels have roughly the same number of data points, we decide not to balance the RRG datasets.
We now run the cross validation tests on the datasets using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.
In [7]:
from analytics import test_classification
In [8]:
results, importances = test_classification(df, n_iterations=1000)
Results: Compared to the top accuracy achieved using forward provenance, 85%, using historical provenance in this application yield much lower accuracy, 66%. This supports our hypothesis that the forward provenance of a data entity correlates better with its nature/characteristic than its historical provenance (as the forward provenance records how the data entity was used).